1. Citation

Citation: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

2. About dataset

The scope of this analysis is to understand relationship of various parameters which impact the quality ratings for both Red and White wine.The data set utilized for the analysis is downloaded from https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv

3. Number of Instances:

red wine - 1599; white wine - 4898.

4. Number of Attributes:

11 + output attribute

5. Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

6. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Load Packages

# Packages used in this EDA
library(ggplot2)
library (gridExtra)
## Loading required package: grid
library(GGally)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:GGally':
## 
##     nasa
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%

Load Data

Rd <- read.csv('wineQualityReds.csv') #1599 obs. of 13 variables
Wd <- read.csv('wineQualityWhites.csv') #4898 obs. of 13 variables

# add categorical varialbles to both sets -- there are 14 variables now
Rd['color'] <- 'red'
Wd['color'] <- 'white'

# merge red wine and white wine datasets
wine <- rbind(Rd, Wd)

# creates a wine dataset of 6497 obs. of 14 variables
dim(wine)
## [1] 6497   14
# gets the names of variables in the dataset
names(wine)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "color"
# internal structure of wine
str(wine)
## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : chr  "red" "red" "red" "red" ...
# Summary of the dataset
summary(wine)
##        X        fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1   Min.   : 3.80   Min.   :0.08     Min.   :0.000  
##  1st Qu.: 813   1st Qu.: 6.40   1st Qu.:0.23     1st Qu.:0.250  
##  Median :1650   Median : 7.00   Median :0.29     Median :0.310  
##  Mean   :2044   Mean   : 7.21   Mean   :0.34     Mean   :0.319  
##  3rd Qu.:3274   3rd Qu.: 7.70   3rd Qu.:0.40     3rd Qu.:0.390  
##  Max.   :4898   Max.   :15.90   Max.   :1.58     Max.   :1.660  
##  residual.sugar    chlorides     free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.60   Min.   :0.009   Min.   :  1.0       Min.   :  6         
##  1st Qu.: 1.80   1st Qu.:0.038   1st Qu.: 17.0       1st Qu.: 77         
##  Median : 3.00   Median :0.047   Median : 29.0       Median :118         
##  Mean   : 5.44   Mean   :0.056   Mean   : 30.5       Mean   :116         
##  3rd Qu.: 8.10   3rd Qu.:0.065   3rd Qu.: 41.0       3rd Qu.:156         
##  Max.   :65.80   Max.   :0.611   Max.   :289.0       Max.   :440         
##     density            pH         sulphates        alcohol    
##  Min.   :0.987   Min.   :2.72   Min.   :0.220   Min.   : 8.0  
##  1st Qu.:0.992   1st Qu.:3.11   1st Qu.:0.430   1st Qu.: 9.5  
##  Median :0.995   Median :3.21   Median :0.510   Median :10.3  
##  Mean   :0.995   Mean   :3.22   Mean   :0.531   Mean   :10.5  
##  3rd Qu.:0.997   3rd Qu.:3.32   3rd Qu.:0.600   3rd Qu.:11.3  
##  Max.   :1.039   Max.   :4.01   Max.   :2.000   Max.   :14.9  
##     quality        color          
##  Min.   :3.00   Length:6497       
##  1st Qu.:5.00   Class :character  
##  Median :6.00   Mode  :character  
##  Mean   :5.82                     
##  3rd Qu.:6.00                     
##  Max.   :9.00

Observations from the summary

1.The alcohol content varies from 8.00 to 14.90 for the samples in dataset.
2.The quality of the samples range from 3 to 9 with 6 as median and 5.818 as mean.
3.The range for fixed acidity is quite high with minimum being 3.8 and maximum being 15.9.
4.pH value varies from 2.720 to 4.010 with a mean of 3.219 and median of 3.210.
5.Mean residual sugar is 5.443 but the max value is 65.800 indicating an outlier.
6.free.sulfur.dioxide has a mean of 30.53 and a high of 289.0.

Univariate Plots and Analysis Section

Analysis of all the single variables using plots

#summarize the fixed.acidity for red and white wine

summary(Wd$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.80    6.30    6.80    6.85    7.30   14.20
summary(Rd$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
#fixed.acidity distribution of wine 
ggplot(wine, aes(x = fixed.acidity, fill=color)) +
  geom_bar(colour="black",position="dodge") +
  ggtitle('fixed.acidity distribution for wine')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk fixed.acidity
Observation about fixed acidity of wine:

Red wine seems to be more acidic than white wine.

In the sample provided the percentage of white wine that is acidic is higher than the percentage of red wine.

#Create a function to be used in the univariate analysis to avoid repetition

uplot <- function(dataset, x, y, gtitle,opts=NULL) {
  ggplot(dataset, aes_string(x = x, fill = y)) +
  geom_bar(colour="black",position="dodge")  +
  ggtitle(gtitle)
}
#volatile.acidity distribution of wine
summary(wine$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.08    0.23    0.29    0.34    0.40    1.58
uplot(wine, "volatile.acidity", "color","volatile.acidity distribution for wine")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk volatile.acidity

The volatile.acidity is slightly skewed so using scale_x_log10 to further analyze this.

#using scale_X_log10 to deal with skew in the volatile.acidity spread
uplot(wine, "volatile.acidity", "color","volatile.acidity distribution for wine") +  
  scale_x_log10(breaks = seq(min(wine$volatile.acidity), max(wine$volatile.acidity), 0.1)) 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk volatile.acidity_log10

## Adjusted bin width
ggplot(wine, aes(x = volatile.acidity, fill=color)) +
  geom_bar(colour="black",position="dodge",binwidth = 0.01) +
  scale_x_log10(breaks = seq(min(wine$volatile.acidity), max(wine$volatile.acidity), 0.1)) +
  ggtitle('volatile.acidity distribution for wine with adjusted bin width')
## Warning: position_dodge requires constant width: output may be incorrect

plot of chunk volatile.acidity_log10
Observation about Volatile acidity of wine:

Volatile.acidity has normal distribution.

The majority of the volatile.acidity seems to be between 0.23 to 0.78.

#citric.acid level in wine
uplot(wine, "citric.acid", "color","citric.acid distribution for wine") 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk citric.acid

The citric.acid is slightly skewed so using scale_x_log10 to further analyze this.

summary(wine$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.250   0.310   0.319   0.390   1.660
#using scale_x_log10 to deal with skew in the  citric.acid data
ggplot(wine, aes(x = citric.acid, fill=color)) +
  geom_histogram() +
  scale_x_log10() +
  ggtitle('citric.acid distribution for wine by log10')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk citric.acid.log10

#using scale_x_continuous as there is some gaps in the plot
ggplot(wine, aes(x = citric.acid, fill=color)) +
  geom_histogram(binwidth = 0.01) +
  scale_x_continuous(breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)) +
  ggtitle('citric.acid distribution for wine by x continuous')
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk citric.acid.log10
Observation about Citric acidity of wine:

citric.acid does not appear to be normally-distributed on a logarithmic scale.

Since the distribution is not normal and the min is 0.0 and since the graph and the data shows close to 150 of 0.0, wanted to see how many were either not reported or had a 0 value.

length(subset(wine, citric.acid == 0)$citric.acid)
## [1] 151

There are around 151 observations had a value of 0.

#residual.sugar level in wine
uplot(wine, "residual.sugar", "color","residual.sugar distribution for wine") 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk residual.sugar

Use scale_x_continuous to further analyze the data

summary(wine$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.60    1.80    3.00    5.44    8.10   65.80
ggplot(wine, aes(x = residual.sugar, fill=color)) +
  geom_bar(colour="black",position="dodge",binwidth = 1) +
  scale_x_continuous(limits = c(0.6, 66)) +
  ggtitle('residual.sugar distribution for wine by x continuous')

plot of chunk residual.sugar.sum

There is an outlier at around 65, majority are between 0.6 to 21

ggplot(wine, aes(x = residual.sugar, fill=color)) +
  geom_bar(position="dodge",binwidth =0.1) +
  scale_x_continuous(limits = c(0.6, 21)) +
  ggtitle('residual.sugar distribution for wine eliminate outliers')
## Warning: position_dodge requires constant width: output may be incorrect

plot of chunk residual.sugar.majority


Observation about residual sugar of wine:

White wine’s residual.sugar goes till 20 whereas red wine’s residual sugar goes to around 5. So some of the white wine seems to be sweeter than the red wine.

#Chloride level in wine
uplot(wine, "chlorides", "color","chloride levels in wine") 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk chloride

Chloride levels seemed to be skewed, so going to use log10 scale to further analyze.

summary(wine$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.009   0.038   0.047   0.056   0.065   0.611
ggplot(wine, aes(x = chlorides, fill=color)) +
  geom_bar(colour="black",position="dodge", binwidth = 0.01) +
  scale_x_log10(breaks = seq(min(wine$chlorides), max(wine$chlorides), 0.09)) +
  ggtitle('chloride levels in wine by log10')
## Warning: position_dodge requires constant width: output may be incorrect

plot of chunk chloride.log10


Observation about Chlorides in wine:

Few white wines have lesser chloride levels. There are some outliers for red wine chloride levels.

#free.sulfur.dioxide level in wine
uplot(wine, "free.sulfur.dioxide", "color","free sulfur dioxide distribution for wine")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk free.SO2

free sulfur dioxide data seems to be skewed so using log10 to further analyze.

summary(wine$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    17.0    29.0    30.5    41.0   289.0
ggplot(wine, aes(x = free.sulfur.dioxide, fill=color)) +
  geom_histogram(binwidth = 0.025,colour="black",position="dodge") +
  scale_x_log10(breaks = c(1, 3, 5, 7, 10, 20, 50,300)) +
  ggtitle('free sulfur dioxide distribution for wine by log10')

plot of chunk free.SO2.log10


Observation about free sulfur dioxide in wine:

More white wines have higher levels of free sulfur dioxide. There are some outliers for white wine at 289.00.

#Amount of total.sulfur.dioxide in wine
uplot(wine, "total.sulfur.dioxide", "color","total sulfur dioxide distribution for wine")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk total.SO2

total sulfur dioxide data seems to be skewed so using log10 to further analyze.

summary(wine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6      77     118     116     156     440
ggplot(wine, aes(x = total.sulfur.dioxide, fill=color)) +
  geom_histogram(binwidth = 0.025,colour="black",position="dodge") +
  scale_x_log10(breaks = c(1, 3, 5, 7, 10, 20, 50,100,200,350)) +
  ggtitle('total sulfur dioxide distribution for wine by log10')

plot of chunk total.SO2.log10


Observation about total sulfur dioxide in wine:

More white wines have higher levels of total sulfur dioxide just as free sulfur dioxide. There are some outliers for white wine around 350.0.

#Density of wine
uplot(wine, "density", "color","density for wine")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk density

density data seems to be skewed so using log10 to further analyze.

summary(wine$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.987   0.992   0.995   0.995   0.997   1.040
ggplot(wine, aes(x = density, fill=color)) +
  geom_histogram(colour="black",binwidth = 0.0002) +
  scale_x_log10(breaks = seq(min(wine$density), 1.0490, 0.002))  +
  ggtitle('density of wine by log10')
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk density.log10


Observation about density in wine:

There is an outlier 1.03911 and between 1.00911 and 1.01111

We can see that density distribution of white wine is bimodal and of red wine is normal distribution.

#pH level in wine

uplot(wine, "pH", "color","pH of wine") 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk pH

summary(wine$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.72    3.11    3.21    3.22    3.32    4.01


Observation about pH in wine:

The pH value seems to display a normal distribution with major samples of white wine exhibiting values between 3.0 and 3.5

#sulphates in wine

uplot(wine, "sulphates", "color","sulphates in wine") 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk sulphates

summary(wine$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.220   0.430   0.510   0.531   0.600   2.000
#further analyze the data by plotting scale_x_continuous and also set the binwidth
ggplot(wine, aes(x = sulphates, fill=color)) +
  geom_histogram(binwidth = 0.01,colour="black") +
  scale_x_continuous(limits = c(0.25, 1.5))  +
  ggtitle('sulphates in wine by x continuous')
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk sulphates

ggplot(wine, aes(x = sulphates, fill=color)) +
  geom_histogram(binwidth = 0.01) +
  scale_x_log10(breaks = c(0.2,0.4,0.6,0.8,1.0,1.2,1.4,1.6,1.8,2.0))  +
  ggtitle('sulphates in wine by log10')
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk sulphates


Observation about sulphates in wine:

There are some gaps in the data, either there is no data with those sulphate values was gathered or wines don’t have that sulphate value.

#Alcohol level in wine

uplot(wine, "alcohol", "color","Alcohol content in wine")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk alcohol

summary(wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0     9.5    10.3    10.5    11.3    14.9
ggplot(wine, aes(x = alcohol, fill=color)) +
  geom_histogram(binwidth = 0.05) +
  scale_x_continuous(breaks = seq(8,15,0.5), lim = c(8,15)) +
  ggtitle('Alcohol content in wine by log10')
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk alcohol
Observation about Alcohol in wine:

Both red and white has the same alcohol distribution pattern.

The peak is around 9.5 for both red and white wine.

#Quality of wine
##Ref.: http://statistics.ats.ucla.edu/stat/r/dae/tobit.htm
summary(wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00
# for the histogram: count = density * sample size * bin width
f <- function(x, var, bw = 1) {
  dnorm(x, mean = mean(var), sd(var)) * length(var)  * bw
}

# setup base plot
p <- ggplot(wine, aes(x = quality, fill=color, binwidth = 1)) +
       geom_bar(colour="black",position="dodge") +
       ggtitle('Quality of wine')

# histogram, colored by proportion in different programs
# with a normal distribution overlaid
p + stat_bin(binwidth=1) +
  stat_function(fun = f, size = 1,
    args = list(var = wine$quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk quality

#create a categorical variable and rate wine quality as bad, average and good
wine$quality_rating <- ifelse(wine$quality < 5, 'bad', 
                              ifelse(wine$quality < 7, 'average', 'good'))
wine$quality_rating <- ordered(wine$quality_rating,levels = c('bad', 'average', 'good'))

summary(wine$quality_rating)
##     bad average    good 
##     246    4974    1277
ggplot(wine, aes(x = quality_rating, fill=color)) +
  geom_histogram(binwidth = 1)  +
  ggtitle('Wine quality rating')

plot of chunk quality
Observation about Quality of wine:

The distribution of wine quality appears to be normal, the Quality is at peak at 5 and 6.

Also created a new variable Quality Rating which classified the wines into Bad, Average and Good bucket based on the quality of wine. Majority fell in the Average rating bucket.

Univariate Analysis

Did you create any new variables from existing variables in the dataset?

Created a new variable quality_rating which classified the wine’s into Bad, Average and Good bucket based on the quality of wine.

Of the features you investigated, were there any unusual distributions?

Density distribution of white wine is bimodal and of red wine is normal distribution.

Did you perform any operations on the data to tidy, adjust, or change the form of the data?

I did not tidy the data but to be able to analyze some of the skewed data I had to use log10.

Bivariate Plots and Analysis Section

pairs(wine)

After reviewing the ggpairs for strong correlation.

We see that there is a strong correlation between the following that can be analyzed further:

we can ignore the correlation between free.sulfur.dioxide and total.sulfur.dioxide as free.S02 is part of total.SO2, total.sulfur.dioxide vs free.sulfur.dioxide(corr - 0.721)
free.sulfur.dioxide vs residual.sugar(corr - 0.403), since the correlation between total.sulfur.dioxide vs residual.sugar is high we are ignoring the correlation between free.sulfur.dioxide vs residual.sugar.

Ref.:http://www.inside-r.org/packages/cran/psych/docs/pairs.panels

cwine <- wine
cwine$color <- ifelse(cwine$color=="red", 1, 2)

pairs.panels(cwine,bg=c("orange","yellow")[wine$color],
   pch=21,main="Wine by color",hist.col="green")

plot of chunk correlation

We can see few of the top correlation pairs are:

alcohol vs. density(corr - -0.69)
density vs residual.sugar(corr - 0.55)
total.sulfur.dioxide vs residual.sugar(corr - 0.50)
density vs fixed.acidity(corr - 0.46)
quality vs alcohol (corr -0.44)
total.sulfur.dioxide vs volatile.acidity(corr - -0.41)
chlorides vs sulphates (corr - 0.40)
chlorides vs volatile.acidity (corr - 0.38)
citric.acid vs fixed.acidity(corr - -0.38)
density vs chlorides (corr - 0.36)
alcohol vs residual.sugar (corr - -0.36)

Further analyzing the elements that affect wine.

#create a variable quality_factor to analyze various levels of quality
wine$quality_factor <- factor(wine$quality, levels=c(0,1,2,3,4,5,6,7,8,9,10))
quality_min <- min(wine$quality)
quality_max <- max(wine$quality)
quality_mean <- mean(wine$quality)
quality_median <- median(wine$quality)
quality_iqr <- IQR(wine$quality)
quality_q1 <- quality_median - quality_iqr
quality_q3 <- quality_median + quality_iqr

summary(wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00

Function to generate graphs to analyze different elements correlation with quality factor

#boxplot function to be overloaded in this section of analysis
quplot <- function(dataset, y, z, yinter, ylbl, gtitle) {
  ggplot(dataset, aes_string(x="quality_factor", y=y, fill=z)) +
  geom_boxplot() +
  geom_hline(show_guide=T, yintercept=yinter, linetype='longdash', alpha=.5, color='blue') +
  geom_vline(xintercept = quality_mean-quality_min+1, linetype='longdash', color='blue', alpha=.5) +
  xlab("Wine Quality") +
  ylab(ylbl) +
  ggtitle(gtitle)
  }

#Scatter plot function to be overloaded in this section for analysis
qucol <- function(dataset,y,yinter, ylbl, gtitle) {
  ggplot(data=dataset, aes_string(x="quality", y=y)) +
  geom_jitter(alpha=1/3) +
  geom_smooth(method='lm', aes(group = 1))+
  geom_hline(yintercept=yinter, linetype='longdash', alpha=.5, color='blue') +
  geom_vline(xintercept = quality_mean, linetype='longdash', color='blue', alpha=.5) +
  xlab("Wine Quality") +
  ylab(ylbl) +
  ggtitle(gtitle) +
  facet_wrap(~color)  
}

The quality of wine vs. Alcohol using box plots as it plays an important role in the microbial stabilization of both red and white wine.

summary(wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0     9.5    10.3    10.5    11.3    14.9
alcohol_mean   <- mean(wine$alcohol)
alcohol_median <- median(wine$alcohol)

In order to analyze the relationship between alcohol and quality, let us see how the alcohol values are distributed across varying quality and how it varies with quality.

tapply(wine$alcohol, wine$quality, mean)
##      3      4      5      6      7      8      9 
## 10.215 10.180  9.838 10.588 11.386 11.679 12.180

Visually alcohol by quality levels along with median and mean is:

summary(wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0     9.5    10.3    10.5    11.3    14.9
quplot(wine, "alcohol", "color", alcohol_mean, "Alcohol", "Alcohol impact on wine Quality")

plot of chunk bivar_alcohol_visual

Observation about Alcohol vs. Quality of Wine:

Both red and white wine that are beyond the mean quality value of 5.818 show values beyond the mean alcohol value of 10.49.

In our sample only some white wines have the highest quality of 9.

Now the same information we can view using scatter plot as below

qucol(wine,"alcohol",alcohol_mean, "Alcohol", "Alcohol impact on Wine Quality based on color")

plot of chunk bivar_scat_alcohol

The quality of wine vs. Residual sugar is displayed using box plots as it an essential component in the production of wine.

During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine.

Ref: https://winemakermag.com/501-measuring-residual-sugar-techniques

summary(wine$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.60    1.80    3.00    5.44    8.10   65.80
ressugar_mean <- mean(wine$residual.sugar)
ressugar_median <- median(wine$residual.sugar)

In order to analyze the relationship between residual.sugar and quality, let us see how the residual.sugar values are distributed across varying quality and how it varies with quality.

tapply(wine$residual.sugar, wine$quality, mean)
##     3     4     5     6     7     8     9 
## 5.140 4.154 5.804 5.550 4.732 5.383 4.120

Visually residual.sugar by quality levels along with median and mean is:

quplot(wine, "residual.sugar", "color", ressugar_mean, "residual.sugar", "Residual sugar impact on Wine Quality")

plot of chunk bivar_rs_visual

Observation about residual.sugar vs. Quality of Wine:

Red wine quality is not impacted by residual.sugar and has less residual.sugar

White wine of highest quality of 9 has residual.sugar less than the mean residual.sugar value.

Now the same information we can view using scatter plot as below

qucol(wine,"residual.sugar",ressugar_mean, "Residual Sugar", "Residual impact on Wine Quality based on color")

plot of chunk bivar_scat_rs

White wine has higher residual.sugar than red wine.

Interesting Fact:* Winemaker who wishes to make a wine with high levels of residual sugar (like a dessert wine) may stop fermentation early either by dropping the temperature of the must to stun the yeast or by adding a high level of alcohol (like brandy) to the must to kill off the yeast and create a fortified wine.[9]*
Ref.: http://en.wikipedia.org/wiki/Fermentation_in_winemaking

The quality of wine vs. chlorides which acts as a preserving agents in the preparation of liquid enzyme preparation which in turn is important for the microbiological stability of wines.
Ref.: http://www.westchesterwinemakers.com/2010/06/03/enzymes-in-winemaking-do-we-use-them-damm-straight-we-do/

summary(wine$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.009   0.038   0.047   0.056   0.065   0.611
chlorides_mean <- mean(wine$chlorides)
chlorides_median <- median(wine$chlorides)

In order to analyze the relationship between chlorides and quality, let us see how the chloride values are distributed across varying quality and how it varies with quality.

tapply(wine$chlorides, wine$quality, mean)
##       3       4       5       6       7       8       9 
## 0.07703 0.06006 0.06467 0.05416 0.04527 0.04112 0.02740

Visually chlorides by quality levels along with median and mean are:

quplot(wine, "chlorides", "color", chlorides_mean, "Chlorides", "Chlorides impact on Wine Quality")

plot of chunk bivar_cl_visual

Observation about Chlorides vs. Quality of Wine:

Both red and white wine that has less chlorides have high quality.

Red wine has more chloride content than white wine. White wine’s chloride content is below the mean chloride.

Now the same information we can view using scatter plot as below

qucol(wine,"chlorides",chlorides_mean, "Chlorides", "Chlorides impact on Wine Quality based on Color")

plot of chunk bivar_cl_scat

White wine has lower chloride levels than red wine.

The quality of wine vs. density using box plots.

It is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content.

https://answers.yahoo.com/question/index?qid=20140527020443AALJISW

summary(wine$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.987   0.992   0.995   0.995   0.997   1.040
density_mean <- mean(wine$density)
density_median <- median(wine$density)

In order to analyze the relationship between density and quality, let us see how the alcohol values are distributed across varying quality and how it varies with quality.

tapply(wine$density, wine$quality, mean)
##      3      4      5      6      7      8      9 
## 0.9957 0.9948 0.9958 0.9946 0.9931 0.9925 0.9915

Visually density by quality levels along with median and mean is:

quplot(wine, "density", "color", density_mean, "Density", "Density impact on Wine Quality")

plot of chunk bivar_dens_visual

Observation about Density vs. Quality of Wine:

Both red and white wine that has less density has high quality.

Red wine is more denser than white wine.

Now the same information we can view using scatter plot as below

qucol(wine, "density", density_mean, "Density", "Density impact on Wine Quality based on color")

plot of chunk bivar_dens_scat

In our sample lot of white wines fall under the quality bucket that are between 4.5 to 7.5 only few have a high quality of 8.

In our sample of red wines majority are between quality 4.5 to 6.5 only some are quality level 7 and very few at 8.

bioth <- function(dataset, y, gtitle) {
  ggplot(dataset,aes_string(x = "quality", y = y)) + 
    geom_point(aes_string(color="color"),alpha=1/4, position = 'jitter') +
    ggtitle(gtitle)
  }
bioth(wine, "total.sulfur.dioxide","Total SO2 and Quality Relationship" )

plot of chunk bivar_so2_quality

bioth(wine, "fixed.acidity","fixed.acidity and Quality Relationship" )

plot of chunk bivar_fixed.acidity_quality

bioth(wine, "sulphates","Sulphates and Quality Relationship" )

plot of chunk bivar_sulphates_quality

As you can see from SO2 vs Quality, Sulphates vs Quality and fixed.acidity vs Quality graphs

The quality of wine varies from 4.5 to 7.5 for both red and white wine irrespective of SO2, sulphates or fixed.acidity level.

Very few white wines are of high quality but the contribution of these elements seems to have no impact on quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol strongly correlates with quality of wine, as alcohol content increases wine quality increases.

Red wine quality is not impacted by residual.sugar and has less residual.sugar. White wine of highest quality of 9 has residual.sugar less than the mean residual.sugar value.

White wine has higher residual.sugar than red wine.

Both red and white wine that has lower chloride level has high quality.

Both red and white wine that has less density has high quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between some elements varies with the color of wine. density vs. fixed.acidity,
chlorides vs. sulphates,
fixed.acidity vs. citric.acid and
residual.sugar vs. alcohol.

What was the strongest relationship you found? Alcohol vs Quality is the strongest relation I found for both wine as per given data.

Multi-variate Plots and Analysis Section

By plotting against each other and faceted by wine quality_rating:

# use function for plotting with ggplot for simplicity of code
plot <- function(dataset, x, y, z, gtitle, opts=NULL) {
  ggplot(dataset, aes_string(x = x, y = y, color = z)) +
    geom_point(alpha = 1/5, position = position_jitter(h = 0), size = 2) +
    facet_wrap(~quality_rating) +
    geom_smooth(method = 'lm') +
    ggtitle(gtitle)
}
# density vs. alcohol(corr - -0.69)
p <- plot(wine, "density", "alcohol", "color","Density vs. Alcohol correlation") 
   
p + coord_cartesian(xlim=c(min(wine$density),1.005), ylim=c(8,15))

plot of chunk density_alcohol

The correlation between alcohol and density is strong for both white and red wines

# residual.sugar vs. density (corr - 0.55)
p <- plot(wine,  "residual.sugar", "density", "color","Residual.sugar vs. Density correlation") 

p + coord_cartesian(xlim=c(min(wine$residual.sugar),25), 
                    ylim=c(min(wine$density), 1.005))

plot of chunk residual.sugar_density

The correlation between residual.sugar and density is strong for white and red wines.

#residual.sugar vs. total.sulfur.dioxide (corr - 0.50)
p <- plot(wine, "residual.sugar", "total.sulfur.dioxide", "color","residual.sugar vs. total.SO2 correlation")

p + scale_x_log10() +
    coord_cartesian(xlim=c(min(wine$residual.sugar),30), 
                    ylim=c(min(wine$total.sulfur.dioxide), 350))

plot of chunk residual.sugar_total.sulfur.dioxide

The correlation between residual.sugar and total.sulfur.dioxide is weak for white and red wine.

# density vs. fixed.acidity(corr - 0.46)
p <- plot(wine, "density", "fixed.acidity", "color","Density vs. fixed.acidity correlation")

p + coord_cartesian(xlim=c(min(wine$density),1.005))

plot of chunk density_fixed.acidity

The correlation between density and fixed.acidity is strong for red wine and none for white wines.

#alcohol vs. quality (corr -0.44)
p <- plot(wine, "quality", "alcohol", "color","Quality vs. Alcohol correlation") 

p + coord_cartesian(xlim=c(2,9.5),
                    ylim=c(min(wine$alcohol),15))

plot of chunk quality_alcohol

The correlation between alcohol and quality is strong for red and white wines.

summary(wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00
#total.sulfur.dioxide vs volatile.acidity(corr - -0.41)
summary(wine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6      77     118     116     156     440
p <- plot(wine, "total.sulfur.dioxide", "volatile.acidity", "color","total.SO2 vs. volatile.acidity correlation") 

p + coord_cartesian(xlim=c(50,275))

plot of chunk total.sulfur.dioxide_volatile.acidity

There is no correlation between volatile.acidity and total.sulfur.dioxide for red and white wines.

#chlorides vs sulphates (corr - 0.40)
p <- plot(wine, "chlorides", "sulphates", "color","chlorides vs. sulphates correlation") 

p + scale_x_log10() +
    coord_cartesian(ylim=c(min(wine$sulphates), 1))

plot of chunk chlorides_sulphates

The correlation between chlorides and sulphates is strong for red and none for white wines.

#chlorides vs volatile.acidity (corr - 0.38)
p <- plot(wine, "chlorides", "volatile.acidity", "color","chlorides vs. volatile.acidity correlation") 

p + scale_x_log10() 

plot of chunk chlorides_volatile.acidity

There is no correlation between chlorides and volatile.acidity for red and white wines.

#citric.acid vs fixed.acidity(corr - -0.38)
summary(wine$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.250   0.310   0.319   0.390   1.660
plot(wine, "fixed.acidity", "citric.acid", "color","fixed.acidity vs. citric.acid correlation") 

plot of chunk citric.acid_fixed.acidity

The correlation between fixed.acidity and citric.acid is strong for red wines and for white wines the correlation between fixed.acidity and citric.acid weakens as it goes from bad to good quality rating.

#chlorides vs. density (corr - 0.36)
p <- plot(wine, "chlorides", "density", "color","chlorides vs. density correlation") 

p + scale_x_log10() + 
  coord_cartesian(ylim=c(min(wine$density), 1.005)) 

plot of chunk chlorides_density

The correlation between chlorides and density is strong for red and white wines.

#residual.sugar vs. alcohol  (corr - -0.36)
p <- plot(wine, "residual.sugar", "alcohol", "color","residual.sugar vs. Alcohol correlation") 

p + coord_cartesian(xlim=c(min(wine$residual.sugar), 25),
                    ylim=c(min(wine$alcohol),15)
                    )

plot of chunk residual.sugar_alcohol

The correlation between alcohol and residual.sugar is strong for white wines and weak to none for red wines.

So in summary

Element pairs Correlation Red White Corr
alcohol vs. density S S 0.69
residual.sugar vs. density S S 0.55
residual.sugar vs. total.sulfur.dioxide W W 0.50
density vs. fixed.acidity S N 0.46
quality vs. alcohol S S 0.44
volatile.acidity vs. total.sulfur.dioxide N N 0.41
chlorides vs. sulphates S N 0.40
volatile.acidity vs. chlorides N N 0.38
fixed.acidity vs. citric.acid S W 0.38
chlorides vs. density S S 0.36
residual.sugar vs. alcohol N S 0.36

From above it is evident that the following correlations depend on the color of the wine
density vs. fixed.acidity,
chlorides vs. sulphates,
fixed.acidity vs. citric.acid and
residual.sugar vs. alcohol.

# correlation functions to be used in drawing graphs when analyzing red and white wines.

rwcorr <- function(dataset, x,y, gtitle) {
  ggplot(dataset, aes_string(x=x, y=y))+
  geom_point(size = 3.5, aes_string(color="quality_factor")) +
  scale_color_brewer(type = 'div') +
  ggtitle(gtitle)
  
}

rwcorrs <- function(dataset, x,y, gtitle) {
ggplot(dataset, aes_string(x = x, y = y)) +
  geom_jitter(alpha = 0.9, aes_string(color = "quality_factor")) +
  geom_smooth(method = "lm", color = "blue") +
  ggtitle(gtitle)
}

Since the number of Red wine is 1/3rd of number of white wine in the sample the correlation between the elements of the sample follow the white rather than red.

So below we are going to analyze some of the key correlations of red wine.

#Create a subset red wine data from cwine 
cRd <-  subset(cwine,color %in% c(1))

pairs.panels(cRd,pch=21,main="Red wine",hist.col="green")
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero

plot of chunk red_panel

In case of red wine the top correlation are between the following elements

Element pairs Correlation Corr
fixed.acidity vs pH (-)0.68
fixed.acidity vs citric.acid 0.67
fixed.acidity vs density 0.67
volatile.acidity vs citric.acid (-)0.55
citric.acid vs pH (-)0.54
density vs. alcohol (-)0.50
#create quality factor
cRd$quality_factor <- as.factor(cRd$quality)
rwcorr(cRd, "fixed.acidity","pH", "fixed.acidity vs. pH correlation for Red")

plot of chunk r_fixed.acidity_pH

As you can see the pH level decreases as acidity increases The correlation between pH and fixed.acidity is negative and does not provide a clear relationship to quality.

rwcorr(cRd, "fixed.acidity","citric.acid", "fixed.acidity vs. citric.acid correlation for Red")

plot of chunk r_fixed.acidity_citric.acid

Wine of quality level 5 has a higher concentration between fixed.acidity level 6 and 10 and citric.acid level between 0 and 0.37. As fixed.acidity increases there is an increase in the citric.acid level in Red wine. Quality level 7 has higher content of citric.acid, indicating higher quality of red wines has more citric.acid in them

rwcorrs(cRd, "fixed.acidity","density", "fixed.acidity vs. density correlation for Red")

plot of chunk r_fixed.acidity_density

Quality of red wine increases along with the increase in the concentration of fixed.acidity and density.

rwcorrs(cRd, "volatile.acidity","citric.acid", "volatile.acidity vs. citric.acid correlation for Red")

plot of chunk r_volatile.acidity_citric.acid

The correlation between volatile.acidity and citric.acid is negative that is as volatile.acidity increases the citric.acid of red wine decreases.

And majority of the wine with high levels of citric acid is in quality level 7 and those with lower levels fall in the quality level 5 range.

This supports the previous theory that level of citric.acid in red wine contributes towards its quality factor.

While fixed.acidity has a positive impact on wine quality volatile.acidity seems to have a negative quality.

rwcorrs(cRd, "citric.acid","pH", "citric.acid vs. pH correlation for Red")

plot of chunk r_citric.acid_pH

pH and Citric.acid correlation does not seem to impact the quality of red wine one way or other.

rwcorrs(cRd, "density","alcohol", "density vs. alcohol correlation for Red")

plot of chunk r_density_alcohol

summary(cRd$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.4     9.5    10.2    10.4    11.1    14.9

Majority of red wine with Quality factor of 7 has alcohol content above 10.

Fixed.acidity is less
Citric.acid is high
Alcohol is high

#create a subset of white wine data from cwine
cWd <-  subset(cwine,color %in% c(2))

pairs.panels(cWd,pch=21,main="White wine",hist.col="green")
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero
## Warning: the standard deviation is zero

plot of chunk white_panel

In case of white wine the top correlation are between the following elements

Element pairs Correlation Corr
residual.sugar vs density 0.84
density vs. alcohol (-)0.78
total.sulfur.dioxide vs density 0.53
residual.sugar vs alcohol (-)0.45
total.sulfur.dioxide vs alcohol (-)0.45
pH vs. fixed.acidity (-)0.43
#create the quality factor
cWd$quality_factor <- as.factor(cWd$quality)
rwcorrs(cWd, "residual.sugar","density", "residual.sugar vs. density correlation for white")

plot of chunk white_residual.sugar_density

The white wine quality is high when the density of wine is less.

rwcorrs(cWd, "density","alcohol", "density vs. alcohol correlation for white")

plot of chunk w_density_alcohol

The white wine quality is high when alcohol is high but the correlation between alcohol and density is negative. This again confirms our above finding about density.

rwcorrs(cWd, "total.sulfur.dioxide","density", "total.SO2 vs. density correlation for white")

plot of chunk w_totalso2_dens

Contribution of total.sulfur.dioxide towards quality is inconclusive.

rwcorrs(cWd, "residual.sugar","alcohol", "residual.sugar vs. alcohol correlation for white")

plot of chunk w_residual.sugar_alcohol

The white wine quality is higher when residual.sugar is less.

rwcorrs(cWd, "total.sulfur.dioxide","alcohol", "total.SO2 vs. alcohol correlation for white")

plot of chunk w_total.so2_alcohol

The quality of white wine is high when total.sulfur.dioxide is < 250 and alcohol content is high.

rwcorrs(cWd, "pH","fixed.acidity", "pH vs. fixed.acidity correlation for white")

plot of chunk w_pH_fixed.acidity

correlation of pH vs. fixed.acidity in relation to quality is inconclusive.

Quality of white wine is good, when

Density is less
residual.sugar is less
alcohol is high

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The relationship between alcohol and density is -ve and strong which has a positive impact on the quality of wine.

In case of white wine the strongest correlation(+ve) is between residual.sugar and density.

In case of red wine the strongest correlation (-ve) was between fixed.acidity and pH.

Were there any interesting or surprising interactions between features?

Correlation between some of the elements was dependent on the wine.

Final Plots and Summary

densp <- function(dataset,x, gtitle) {
  ggplot(data=wine, aes_string(x=x, fill="quality_factor")) +
  geom_density()+
  ggtitle(gtitle)
  }

Plots One

#summarize the quality variable
summary(wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00
# In the given sample how many of them fall in each of the quality level.
table(wine$quality)
## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5
# tabling red and white wine separately to view their distribution since the sample does not have equal number of red and white wine samples.
table(cRd$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
table(cWd$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
#Given samples quality distribution
ggplot(wine) + geom_density(aes(x=quality, fill=color))

plot of chunk Plot1

Description One

The distribution of wine quality appears to be normal distribution. The Quality peaks at 5 for red and 6 for white wine. When you review the tabled data for white and red separately we can see that red wine appears to be bimodal.

Plot Two

summary(wine$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.987   0.992   0.995   0.995   0.997   1.040
#Boxplot depicting density of wine based on the color of wine
quplot(wine, "density", "color", density_mean, "Density", "density relationship with wine Quality")

plot of chunk Plot2

#Quality bucket in which wine's with differing density fell under
densp(wine,"density","density relationship with wine Quality")

plot of chunk Plot2

Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content.

# density vs. alcohol(corr - -0.69)
# Correlation that exist between density vs. alcohol in our wine sample
p <- ggplot(wine, aes(x=density, y=alcohol, color = color)) +
  geom_point(alpha = 1/3, position = position_jitter(h = 0), size = 2) +  
  geom_smooth(method = 'lm') +
  ggtitle('Density vs. Alcohol correlation by Color')

p + coord_cartesian(xlim=c(min(wine$density),1.005), ylim=c(8,15))

plot of chunk plot2_cont

Description Two

So from the graphs it is evident that wines with low density have high quality. Also alcohol and density have a strong -ve correlation of -0.69.

Plot Three

#alcohol vs. quality_factor (corr -0.44)
quplot(wine, "alcohol", "color", alcohol_mean, "Alcohol", "alcohol vs. quality_factor correlation by wine color")

plot of chunk Plot3

Now the impact of density and alcohol on quality of wine can be depicted as

#Our sample had wines in the quality level bucket 5 and 6

#Correlation of density vs. alcohol with respect to quality factor
rwcorrs(wine, "density","alcohol", "Density vs. Alcohol correlation to Quality_Factor")

plot of chunk Plot3_corr_anal

#Level of alcohol in our sample of wine that fall under different quality bucket
densp(wine,"alcohol","alcohol relationship with wine Quality")

plot of chunk Plot3_corr_anal

#Create a categorical variable to see why the correlation is only 0.44.

#create categorical variable to show different buckets of quality level
wine$quality.cut <- cut(wine$quality, breaks=c(0,4,6,10))

#Graph to show the correlation between alcohol vs density based on quality cut
ggplot(data=wine, aes(x=density, y=alcohol)) +
  coord_cartesian(
    xlim=c(quantile(wine$density,.01),quantile(wine$density,.99)),
    ylim=c(quantile(wine$alcohol,.01),quantile(wine$alcohol,.99))
    ) +
  geom_jitter(alpha=.5, aes(size=quality.cut, color=quality.cut)) +
  xlab("Density") +
  ylab("Alcohol") +
  ggtitle('density vs. alcohol correlation for wine sliced by quality')

plot of chunk Plot3_corr_anal

#tabling quality_factor to see in which bucket the number of wines in our sample come under.
table(wine$quality.cut)
## 
##  (0,4]  (4,6] (6,10] 
##    246   4974   1277

Description Three

Even though our graph and the data does indicate that higher alcohol content and lower density contribute to a good quality wine. The correlation between quality vs. alcohol doesn’t seem to be that strong (0.44).

So to analyze that further created the quality_cut categorical variable and plotted the correlation.

The quality_cut correlation graph showed the reason for the weaker correlation is majority of our wine sample fall under (4,6] quality bucket.

Below link gives the 5 key components of wine.

http://www.snooth.com/articles/five-key-wine-components-and-how-to-detect-them/?viewall=1

Reflection

The wine data set contains information from both red and white wine. I started by understanding the individual variables in the data set by plotting graphs and also visiting websites to see what contribution each elements make.

Then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wine based on density and alcohol.

It is interesting that even though the graph does show that increase in alcohol content is an indication of good quality wine, the correlation between quality and alcohol is not strong.

Then further analyzing realized that the majority of the sample of data falls between 4 - 6 quality (which is average) and hence maybe the correlation is not a true reflection.

The data should have more red wine sample so the analysis is not favoring the characteristic of one wine over another.